33 research outputs found
Revisiting Self-Training with Regularized Pseudo-Labeling for Tabular Data
Recent progress in semi- and self-supervised learning has caused a rift in
the long-held belief about the need for an enormous amount of labeled data for
machine learning and the irrelevancy of unlabeled data. Although it has been
successful in various data, there is no dominant semi- and self-supervised
learning method that can be generalized for tabular data (i.e. most of the
existing methods require appropriate tabular datasets and architectures). In
this paper, we revisit self-training which can be applied to any kind of
algorithm including the most widely used architecture, gradient boosting
decision tree, and introduce curriculum pseudo-labeling (a state-of-the-art
pseudo-labeling technique in image) for a tabular domain. Furthermore, existing
pseudo-labeling techniques do not assure the cluster assumption when computing
confidence scores of pseudo-labels generated from unlabeled data. To overcome
this issue, we propose a novel pseudo-labeling approach that regularizes the
confidence scores based on the likelihoods of the pseudo-labels so that more
reliable pseudo-labels which lie in high density regions can be obtained. We
exhaustively validate the superiority of our approaches using various models
and tabular datasets.Comment: 10 pages for the main part and 8 extra pages for the appendix. 2
figures and 3 tables for the main par
CAST: Cluster-Aware Self-Training for Tabular Data
Self-training has gained attraction because of its simplicity and
versatility, yet it is vulnerable to noisy pseudo-labels. Several studies have
proposed successful approaches to tackle this issue, but they have diminished
the advantages of self-training because they require specific modifications in
self-training algorithms or model architectures. Furthermore, most of them are
incompatible with gradient boosting decision trees, which dominate the tabular
domain. To address this, we revisit the cluster assumption, which states that
data samples that are close to each other tend to belong to the same class.
Inspired by the assumption, we propose Cluster-Aware Self-Training (CAST) for
tabular data. CAST is a simple and universally adaptable approach for enhancing
existing self-training algorithms without significant modifications.
Concretely, our method regularizes the confidence of the classifier, which
represents the value of the pseudo-label, forcing the pseudo-labels in
low-density regions to have lower confidence by leveraging prior knowledge for
each class within the training data. Extensive empirical evaluations on up to
20 real-world datasets confirm not only the superior performance of CAST but
also its robustness in various setups in self-training contexts.Comment: 17 pages with appendi
Revealing mammalian evolutionary relationships by comparative analysis of gene clusters
Many software tools for comparative analysis of genomic sequence data have been released in recent decades. Despite this, it remains challenging to determine evolutionary relationships in gene clusters due to their complex histories involving duplications, deletions, inversions, and conversions. One concept describing these relationships is orthology. Orthologs derive from a common ancestor by speciation, in contrast to paralogs, which derive from duplication. Discriminating orthologs from paralogs is a necessary step in most multispecies sequence analyses, but doing so accurately is impeded by the occurrence of gene conversion events. We propose a refined method of orthology assignment based on two paradigms for interpreting its definition: by genomic context or by sequence content. X-orthology (based on context) traces orthology resulting from speciation and duplication only, while N-orthology (based on content) includes the influence of conversion events
Conversion events in gene clusters
<p>Abstract</p> <p>Background</p> <p>Gene clusters containing multiple similar genomic regions in close proximity are of great interest for biomedical studies because of their associations with inherited diseases. However, such regions are difficult to analyze due to their structural complexity and their complicated evolutionary histories, reflecting a variety of large-scale mutational events. In particular, conversion events can mislead inferences about the relationships among these regions, as traced by traditional methods such as construction of phylogenetic trees or multi-species alignments.</p> <p>Results</p> <p>To correct the distorted information generated by such methods, we have developed an automated pipeline called CHAP (Cluster History Analysis Package) for detecting conversion events. We used this pipeline to analyze the conversion events that affected two well-studied gene clusters (α-globin and β-globin) and three gene clusters for which comparative sequence data were generated from seven primate species: CCL (chemokine ligand), IFN (interferon), and CYP2abf (part of cytochrome P450 family 2). CHAP is freely available at <url>http://www.bx.psu.edu/miller_lab</url>.</p> <p>Conclusions</p> <p>These studies reveal the value of characterizing conversion events in the context of studying gene clusters in complex genomes.</p
Evaluation of methods for detecting conversion events in gene clusters
Background: Gene clusters are genetically important, but their analysis poses significant computational challenges. One of the major reasons for these difficulties is gene conversion among the duplicated regions of the cluster, which can obscure their true relationships. Many computational methods for detecting gene conversion events have been released, but their performance has not been assessed for wide deployment in evolutionary history studies due to a lack of accurate evaluation methods. Results: We designed a new method that simulates gene cluster evolution, including large-scale events of duplication, deletion, and conversion as well as small mutations. We used this simulation data to evaluate several different programs for detecting gene conversion events. Conclusions: Our evaluation identifies strengths and weaknesses of several methods for detecting gene conversion, which can contribute to more accurate analysis of gene cluster evolution
Correction: AGAPE (Automated Genome Analysis PipelinE) for Pan-Genome Analysis of Saccharomyces cerevisiae
The characterization and public release of genome sequences from thousands of organisms is expanding the scope for genetic variation studies. However, understanding the phenotypic consequences of genetic variation remains a challenge in eukaryotes due to the complexity of the genotype-phenotype map. One approach to this is the intensive study of model systems for which diverse sources of information can be accumulated and integrated. Saccharomyces cerevisiae is an extensively studied model organism, with well-known protein functions and thoroughly curated phenotype data. To develop and expand the available resources linking genomic variation with function in yeast, we aim to model the pan-genome of S. cerevisiae. To initiate the yeast pan-genome, we newly sequenced or re-sequenced the genomes of 25 strains that are commonly used in the yeast research community using advanced sequencing technology at high quality. We also developed a pipeline for automated pan-genome analysis, which integrates the steps of assembly, annotation, and variation calling. To assign strain-specific functional annotations, we identified genes that were not present in the reference genome. We classified these according to their presence or absence across strains and characterized each group of genes with known functional and phenotypic features. The functional roles of novel genes not found in the reference genome and associated with strains or groups of strains appear to be consistent with anticipated adaptations in specific lineages. As more S. cerevisiae strain genomes are released, our analysis can be used to collate genome data and relate it to lineage-specific patterns of genome evolution. Our new tool set will enhance our understanding of genomic and functional evolution in S. cerevisiae, and will be available to the yeast genetics and molecular biology community
DISPAQ: Distributed Profitable-Area Query from Big Taxi Trip Data
One of the crucial problems for taxi drivers is to efficiently locate passengers in order to increase profits. The rapid advancement and ubiquitous penetration of Internet of Things (IoT) technology into transportation industries enables us to provide taxi drivers with locations that have more potential passengers (more profitable areas) by analyzing and querying taxi trip data. In this paper, we propose a query processing system, called Distributed Profitable-Area Query (DISPAQ) which efficiently identifies profitable areas by exploiting the Apache Software Foundation’s Spark framework and a MongoDB database. DISPAQ first maintains a profitable-area query index (PQ-index) by extracting area summaries and route summaries from raw taxi trip data. It then identifies candidate profitable areas by searching the PQ-index during query processing. Then, it exploits a Z-Skyline algorithm, which is an extension of skyline processing with a Z-order space filling curve, to quickly refine the candidate profitable areas. To improve the performance of distributed query processing, we also propose local Z-Skyline optimization, which reduces the number of dominant tests by distributing killer profitable areas to each cluster node. Through extensive evaluation with real datasets, we demonstrate that our DISPAQ system provides a scalable and efficient solution for processing profitable-area queries from huge amounts of big taxi trip data